The high-throughput poly(A) RNA-seq data used in this notebook are described in Neeves et al, Brain (2022). They are derived from nuclear and cytoplasmic fractions of human induced pluripotent stem cells (hiPSC; day 0), neural precursors (NPC; day 3 and day 7), āpatternedā precursor motor neurons (ventral spinal cord; pMN; day 14), post-mitotic but electrophysiologically immature motor neurons (MN; day 22), and electrophysiologically active MNs (mMNs; day 35).
The gene expression count data was obtained from Kallisto (Bray et
al., 2016) using the Gencode hg38 release Homo sapiens transcriptome.
The quantification of alternative splicing was performed using VAST-TOOLS (Tapial et
al., 2017). The data required for this practical session can be
downloaded from Zenodo
(please note this is a different version of the one in the
previous module).
load("C:/Users/norar/OneDrive/Documentos/EPFL/Genomics and bioinformatics/Week 8/AS_GE_data.RData")
#Data:
# info : data-frame of information for the nuclear samples
# myE_gen : normalized gene expression count matrix of the CTRL nuclear fraction quantile-normalized
# dat_diff_time_nuc_ctrl : data-frame with different AS analysis for the nuclear compartment
# dat_diff_time_cyto_ctrl : data-frame with different AS analysis for the cytoplasmic compartment
# tab_psi : data frame with all events in VAST-TOOLS and their PSI values in each sample (`dat_diff_time_nuc_ctrl` and `dat_diff_time_cyto_ctrl` are data-frame containing the 'average' changes over time in splicing while `tab_psi` contains the PSI values for each individual samples)
#PSI (Percentage spliced in) is the ratio of normalized read counts indicating inclusion of a transcript element over the total normalized reads for that event (both inclusion and exclusion reads). A PSI value of 0.8 for an exon skip event would indicate that the exon is included in approximately 80% of the transcripts in the sample. Changes in average PSI values when comparing groups of samples indicate a shift in splicing patters between the groups or a splice event.
#Here I make some nice colors to facilitate the visualisation
mygroup <- factor(as.character(info$DIV),levels=c(0,3,7,14,22,35))
mycols_days <- c("#CCFF00","#33CC33","#669999","#6699FF","#3300FF","#CC33CC")
names(mycols_days) <- c(0,3,7,14,22,35)
mycols <- unlist(lapply(info$DIV,function(Z)return(mycols_days[match(as.character(Z),names(mycols_days))])))#Biological Pathway Gene Enrichment Analysis
GO_analysis <- function(genes_list){
gostres_diff <- gost(query = genes_list,
organism = "hsapiens", ordered_query = FALSE,
multi_query = FALSE, significant = TRUE, exclude_iea = FALSE,
measure_underrepresentation = FALSE, evcodes = TRUE,
user_threshold = 0.05, correction_method = "g_SCS",
domain_scope = "annotated", custom_bg = NULL,
numeric_ns = "", as_short_link = FALSE,sources=c("GO:BP", "GO:MF","KEGG"))
gostplot(gostres_diff, capped = FALSE, interactive = TRUE)#please note this is going to create an interactive plot
}
#Create a venn diagram from two gene lists
## This is an example - not meant to run
vennDiag <- function(genes_lists){
genes_comparisons <- do.call(what=cbind,args=lapply(genes_lists,function(Z)return(rownames(myE_gen)%in%Z)))
colnames(genes_comparisons)<-c("cond1","cond2")
vennDiagram(genes_comparisons,main="blasdfas")
}
#Plot PSI events over time in nuclear cyto fractions
plotEventOverTime <- function(event="HsaINT0002504"){
toi <- c("E.dPsi.d_3_0","E.dPsi.d_7_0","E.dPsi.d_14_0","E.dPsi.d_22_0","E.dPsi.d_35_0")
dat_nuc <- dat_diff_time_nuc_ctrl[match(event,dat_diff_time_nuc_ctrl$EVENT),match(toi,colnames(dat_diff_time_nuc_ctrl))]
dat_cyto<- dat_diff_time_cyto_ctrl[match(event,dat_diff_time_cyto_ctrl$EVENT),match(toi,colnames(dat_diff_time_cyto_ctrl))]
tempdat <- rbind(as.numeric(dat_nuc),as.numeric(dat_cyto))
rownames(tempdat)<-c("nuc","cyto")
colnames(tempdat)<- c("3","7","14","22","35")
tempdat[is.na(tempdat)]<-0
barplot(tempdat,beside=TRUE,las=1,ylab="dPSI",xlab="DIV",main=paste(tab_psi$GENE[match(event,tab_psi$EVENT)], "-", tab_psi$COORD[match(event,tab_psi$EVENT)]),col=c("#CC9C3D","#476DB4"))
legend("topleft",ncol=1,pch=15,col=c("#CC9C3D","#476DB4"),leg=c("nuc","cyto"),cex=0.7)
grid()
}
#To following function sort myvec in decreasing order and return their indexes:
myidx<- sort(x=c(1:2000),index.return=TRUE,decreasing=TRUE)$ixdat_diff_time_nuc_ctrl data-table) and cytoplasmic
fractions (dat_diff_time_cyto_ctrl). The schematic below
shows the four main types of events we are going to look at.
Here is a summary of the column content which you can find at on Github repo of VAST-TOOL:
First you always need to make sure you understand the format of your
data, the content of the columns and the rows. Start by checking the
dimensions of the data using dim, nrow,
ncol functions. Then if the data-count table is not too
large, you can use the View function to visualise and
explore its content. Alternatively you can print a couple
of rows. Here you need to understand the output from VAST-TOOLS.
## [1] 365711 102
## [1] "GENE" "EVENT" "COORD" "LENGTH" "FullCO" "COMPLEX" "0_CB1D_cyto" "0_CB1D_nuc" "0_CB1E_cyto"
## [10] "0_CB1E_nuc" "0_CTRL1_cyto" "0_CTRL1_nuc" "0_CTRL3_cyto" "0_CTRL3_nuc" "0_CTRL4_cyto" "0_CTRL4_nuc" "0_CTRL5_cyto" "0_CTRL5_nuc"
## [19] "0_GliA_cyto" "0_GliA_nuc" "0_GliB_cyto" "14_CB1D_cyto" "14_CB1D_nuc" "14_CB1E_cyto" "14_CB1E_nuc" "14_CTRL1_cyto" "14_CTRL1_nuc"
## [28] "14_CTRL3_cyto" "14_CTRL3_nuc" "14_CTRL4_cyto" "14_CTRL4_nuc" "14_CTRL5_cyto" "14_CTRL5_nuc" "14_GliA_cyto" "14_GliA_nuc" "14_GliB_cyto"
## [37] "14_GliB_nuc" "22_CB1D_cyto" "22_CB1D_nuc" "22_CB1E_cyto" "22_CB1E_nuc" "22_CTRL1_cyto" "22_CTRL1_nuc" "22_CTRL3_cyto" "22_CTRL3_nuc"
## [46] "22_CTRL4_cyto" "22_CTRL4_nuc" "22_CTRL5_cyto" "22_CTRL5_nuc" "22_GliA_cyto" "22_GliA_nuc" "22_GliB_cyto" "22_GliB_nuc" "35_CB1D_cyto"
## [55] "35_CB1D_nuc" "35_CB1E_cyto" "35_CB1E_nuc" "35_CTRL1_cyto" "35_CTRL1_nuc" "35_CTRL3_cyto" "35_CTRL3_nuc" "35_CTRL4_cyto" "35_CTRL4_nuc"
## [64] "35_CTRL5_cyto" "35_CTRL5_nuc" "35_GliA_cyto" "35_GliA_nuc" "35_GliB_cyto" "35_GliB_nuc" "3_CB1D_cyto" "3_CB1D_nuc" "3_CB1E_cyto"
## [73] "3_CB1E_nuc" "3_CTRL1_cyto" "3_CTRL1_nuc" "3_CTRL3_cyto" "3_CTRL3_nuc" "3_CTRL4_cyto" "3_CTRL4_nuc" "3_CTRL5_cyto" "3_CTRL5_nuc"
## [82] "3_GliA_cyto" "3_GliA_nuc" "3_GliB_cyto" "3_GliB_nuc" "7_CB1D_cyto" "7_CB1D_nuc" "7_CB1E_cyto" "7_CB1E_nuc" "7_CTRL1_cyto"
## [91] "7_CTRL1_nuc" "7_CTRL3_cyto" "7_CTRL3_nuc" "7_CTRL4_cyto" "7_CTRL4_nuc" "7_CTRL5_cyto" "7_CTRL5_nuc" "7_GliA_cyto" "7_GliA_nuc"
## [100] "7_GliB_cyto" "7_GliB_nuc" "COMPLEX_short"
## [1] 365711 21
## [1] "GENE" "EVENT" "COORD" "LENGTH" "FullCO" "COMPLEX" "E.dPsi.d_3_0" "MV.dPsi.d_3_0"
## [9] "E.dPsi.d_7_0" "MV.dPsi.d_7_0" "E.dPsi.d_7_3" "MV.dPsi.d_7_3" "E.dPsi.d_14_0" "MV.dPsi.d_14_0" "E.dPsi.d_22_0" "MV.dPsi.d_22_0"
## [17] "E.dPsi.d_35_0" "MV.dPsi.d_35_0" "E.dPsi.d_35_22" "MV.dPsi.d_35_22" "COMPLEX_short"
## [1] 365711 21
## [1] "GENE" "EVENT" "COORD" "LENGTH" "FullCO" "COMPLEX" "E.dPsi.d_3_0" "MV.dPsi.d_3_0"
## [9] "E.dPsi.d_7_0" "MV.dPsi.d_7_0" "E.dPsi.d_7_3" "MV.dPsi.d_7_3" "E.dPsi.d_14_0" "MV.dPsi.d_14_0" "E.dPsi.d_22_0" "MV.dPsi.d_22_0"
## [17] "E.dPsi.d_35_0" "MV.dPsi.d_35_0" "E.dPsi.d_35_22" "MV.dPsi.d_35_22" "COMPLEX_short"
It is generally good to get an understanding of the type of events which are differentially spliced between two conditions. Pie-charts can be useful to visualise the relative abundances of events in each categories between different types of conditions. \(\Delta\)PSI>0 indicates inclusion while \(\Delta\)PSI<0 indicates skipping between two conditions. The default threshold of a difference in PSI is 15%.
Letās start by looking at the relative number of events with a threshold difference of 20% between the iPSC stage (DIV=0) and the MN stage (DIV=35) in the nucleus:
col_events=c("#CD4599","#8C8C8C","#6ABD45","#4B5493","#FEC20F")
names(col_events) <- names(table(dat_diff_time_nuc_ctrl$COMPLEX_short))
par(mfrow=c(1,3),mar=c(1,1,2,1))
pie(table(dat_diff_time_nuc_ctrl$COMPLEX_short),col=col_events)
mtext(side=3,line=-1,text="All annotated events",cex=0.7)
pie(table(dat_diff_time_nuc_ctrl$COMPLEX_short[dat_diff_time_nuc_ctrl$E.dPsi.d_35_0>0.2]),col=col_events)
mtext(side=3,line=0,text="NUCLEUS -- DIV=35",cex=0.7)
mtext(side=3,line=-1,text="dPSI>20%",cex=0.7)
pie(table(dat_diff_time_nuc_ctrl$COMPLEX_short[dat_diff_time_nuc_ctrl$E.dPsi.d_35_0<(-0.2)]),col=col_events)
mtext(side=3,line=-1,text="dPSI<(-20%)",cex=0.7)
And compare these with event distribution in cytoplasmic fractions:
par(mfrow=c(1,3),mar=c(1,1,2,1))
pie(table(dat_diff_time_cyto_ctrl$COMPLEX_short),col=col_events)
mtext(side=3,line=-1,text="All annotated events",cex=0.7)
pie(table(dat_diff_time_cyto_ctrl$COMPLEX_short[dat_diff_time_cyto_ctrl$E.dPsi.d_35_0>0.2]),col=col_events)
mtext(side=3,line=0,text="CYTOPLASM --DIV=35",cex=0.7)
mtext(side=3,line=-1,text="dPSI>20%",cex=0.7)
pie(table(dat_diff_time_cyto_ctrl$COMPLEX_short[dat_diff_time_cyto_ctrl$E.dPsi.d_35_0<(-0.2)]),col=col_events)
mtext(side=3,line=-1,text="dPSI<(-20%)",cex=0.7)
What are the key observations?
Having found some differences at MN stages in relative distributions of included events between cyto and nuc (enrichment in nuclear fraction of IR), letās now test whether splicing patterns are similar in nucleus versus cytoplasm over time.
pcc_events<- do.call(what=cbind,args=lapply(unique(dat_diff_time_cyto_ctrl$COMPLEX_short),function(EV)return(unlist(lapply(c("E.dPsi.d_3_0", "E.dPsi.d_7_0" , "E.dPsi.d_14_0" , "E.dPsi.d_22_0" , "E.dPsi.d_35_0"),function(coln)return(cor(dat_diff_time_nuc_ctrl[dat_diff_time_nuc_ctrl$COMPLEX_short==EV,match(coln,colnames(dat_diff_time_nuc_ctrl))],dat_diff_time_cyto_ctrl[dat_diff_time_cyto_ctrl$COMPLEX_short==EV,match(coln,colnames(dat_diff_time_cyto_ctrl))],use = "complete")))))))
colnames(pcc_events)<- unique(dat_diff_time_cyto_ctrl$COMPLEX_short)
rownames(pcc_events)<-c("E.dPsi.d_3_0", "E.dPsi.d_7_0" , "E.dPsi.d_14_0" , "E.dPsi.d_22_0" , "E.dPsi.d_35_0")
par(mfrow=c(1,1))
barplot(pcc_events,beside=TRUE,col=unlist(lapply(colnames(pcc_events),function(Z)return(rep(col_events[match(Z,names(col_events))],5)))),las=1,ylim=c(0,1),ylab="Pearson Correlation Coefficient")
Letās now visualise these results, in particular the changes between NPC and MN stages in terms of alternative exon usage and intron retention:
par(mfrow=c(2,2),mar=c(4,4,4,4))
plot(dat_diff_time_nuc_ctrl$E.dPsi.d_14_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="IR"],dat_diff_time_cyto_ctrl$E.dPsi.d_14_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="IR"],pch=19,col=rgb(0,0,0,0.2),cex=0.5,xlab="nucleus",ylab="cytoplasm",main="",las=1)
mtext(side=3,line=2,text="Intron Retention",cex=0.7)
mtext(side=3,line=1,text="dPSI [DIV=0-->DIV=14]",cex=0.7)
grid()
abline(0,1,col="red",lty=2)
mtext(side=3,line=0,text=paste("PCC=",round(cor(dat_diff_time_nuc_ctrl$E.dPsi.d_14_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="IR"],dat_diff_time_cyto_ctrl$E.dPsi.d_14_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="IR"],use = "complete"),digit=2)),cex=0.7)
plot(dat_diff_time_nuc_ctrl$E.dPsi.d_14_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="EX"],dat_diff_time_cyto_ctrl$E.dPsi.d_14_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="EX"],pch=19,col=rgb(0,0,0,0.2),cex=0.5,xlab="nucleus",ylab="cytoplasm",main="",las=1)
mtext(side=3,line=2,text="Alternative Exon",cex=0.7)
mtext(side=3,line=1,text="dPSI [DIV=0-->DIV=14]",cex=0.7)
mtext(side=3,line=0,text=paste("PCC=",round(cor(dat_diff_time_nuc_ctrl$E.dPsi.d_14_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="EX"],dat_diff_time_cyto_ctrl$E.dPsi.d_14_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="EX"],use = "complete"),digit=2)),cex=0.7)
grid()
abline(0,1,col="red",lty=2)
plot(dat_diff_time_nuc_ctrl$E.dPsi.d_35_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="IR"],dat_diff_time_cyto_ctrl$E.dPsi.d_35_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="IR"],pch=19,col=rgb(0,0,0,0.2),cex=0.5,xlab="nucleus",ylab="cytoplasm",main="",las=1)
mtext(side=3,line=2,text="Intron Retention",cex=0.7)
mtext(side=3,line=1,text="dPSI [DIV=0-->DIV=35]",cex=0.7)
mtext(side=3,line=0,text=paste("PCC=",round(cor(dat_diff_time_nuc_ctrl$E.dPsi.d_35_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="IR"],dat_diff_time_cyto_ctrl$E.dPsi.d_35_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="IR"],use = "complete"),digit=2)),cex=0.7)
grid()
abline(0,1,col="red",lty=2)
plot(dat_diff_time_nuc_ctrl$E.dPsi.d_35_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="EX"],dat_diff_time_cyto_ctrl$E.dPsi.d_35_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="EX"],pch=19,col=rgb(0,0,0,0.2),cex=0.5,xlab="nucleus",ylab="cytoplasm",main="",las=1)
mtext(side=3,line=2,text="Alternative Exon",cex=0.7)
mtext(side=3,line=1,text="dPSI [DIV=0-->DIV=35]",cex=0.7)
mtext(side=3,line=0,text=paste("PCC=",round(cor(dat_diff_time_nuc_ctrl$E.dPsi.d_35_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="EX"],dat_diff_time_cyto_ctrl$E.dPsi.d_35_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="EX"],use = "complete"),digit=2)),cex=0.7)
grid()
abline(0,1,col="red",lty=2)
If interested have a look at the role of nuclear detention of intron-retaining transcripts.
While hierarchical clustering might be dominated by some changes in few genes, other methods such as singular value decomposition (SVD) analysis can be instrumental in identifying independent biological pathways underlying changes in gene expression. It is also very appropriate for time-series analysis when variation is not linearly dependent on time.
#This function extract the fraction of variance per component as derived from SVD analysis
#Input= SVD object
getFractionVariance<- function(mySVD){
return(mySVD$d*mySVD$d/sum(mySVD$d*mySVD$d))
}
#This function calculate the Shannon Entropy which indicates how well the information is spread among the principal components.
#High value indicates information evenly distributed among components
#Low value indicates most information/variance is captured by a single component
#Takes as input the fraction of variance outputted from getFractionVariance
getShannonEntropy <- function(pip)return(-1*sum(pip*log10(pip))/log10(length(pip)))
#This function plots the fraction of variance captured by first N components.
#pi=fraction of variance as obtained from getFractionVariance
#dp=Shannon Entropy as obtained from getShannonEntropy
#N=number of components to plot
ScrePlot <- function(pi,dp,N=NA){
if(is.na(N)){
barplot(pi,las=1,cex.main=0.7,cex.axis=1.0,col="black")
}
else{
barplot(pi[c(1:N)],las=1,cex.main=0.7,cex.axis=1.0,col="black")
}
mtext(side = 1, line = 2, text ="principal components", col = "black",cex=0.7, font=1)
mtext(side = 2, line = 2, text ="Fraction of explained variance", col = "black",cex=0.7, font=1)
mtext(side = 3, line = -2, text = paste("Shannon Entropy = ",round(dp,digits=3)), col = "black",cex=0.7, font=1)
}For this analysis we will focus on the nuclear samples and use the different types of data collected so far i.e.Ā gene expression, PUD, IR and AltEx. A similar analysis can be done on the cytoplasmic fraction.
ge <- myE_gen
EX_nuc <- as.matrix(tab_psi[which(tab_psi$COMPLEX_short=="EX"),match(colnames(myE_gen),colnames(tab_psi))])
IR_nuc <- as.matrix(tab_psi[which(tab_psi$COMPLEX_short=="IR"),match(colnames(myE_gen),colnames(tab_psi))])
rownames(EX_nuc) <- as.character(tab_psi$EVENT)[which(tab_psi$COMPLEX_short=="EX")]
rownames(IR_nuc) <- as.character(tab_psi$EVENT)[which(tab_psi$COMPLEX_short=="IR")]
#You need to remove NA's and NaN for the unsupervised analysis
tosel_ex <- which(apply(is.na(EX_nuc),1,sum)==0&apply(is.nan(EX_nuc),1,sum)==0)
tosel_ir <- which(apply(is.na(IR_nuc),1,sum)==0&apply(is.nan(IR_nuc),1,sum)==0)
EX_nuc <- EX_nuc[tosel_ex,]
IR_nuc <- IR_nuc[tosel_ir,]It is common in SVD analysis to remove the first component which captures noise == most variations from from systematic changes in basal gene expression between genes.
#SVD on the Gene Expression
dat1 <- ge
SVD_eset <- svd(dat1)
pi_1 <- getFractionVariance(mySVD=SVD_eset)
dp_1 <- getShannonEntropy(pi_1)
datm <- dat1-SVD_eset$d[1]*(SVD_eset$u[,1]%*%t(SVD_eset$v[,1]))
SVD_eset_GE <- svd(datm)
pi_2 <- getFractionVariance(mySVD=SVD_eset_GE)
dp_2 <- getShannonEntropy(pi_2)
par(mfrow=c(2,4))
ScrePlot(pi_1,dp_1,N=NA)
mtext(side=3,line=0,text="prior removing first component")
#Effect on the sample distribution
multidensity(dat1[,c(1,10,15,20)],main="prior filtering",las=1)
boxplot(dat1,outline=FALSE)
boxplot(t(dat1[c(1,4,10,44,69,1000,2000,4000),]),outline=FALSE,las=2,ylab="log2(count)")
ScrePlot(pi_2,dp_2,N=NA)
mtext(side=3,line=0,text="after removing first component")
multidensity(datm[,c(1,10,15,20)],main="after filtering",las=1)
boxplot(datm,outline=FALSE)
#Effect on the genes distribution
boxplot(t(datm[c(1,4,10,44,69,1000,2000,4000),]),outline=FALSE,las=2,ylab="log2(count)")
We start by performing SVD on individual matrices and compare how information is captured in the different matrices.
layout(matrix(c(1:6),ncol=3,nrow=2,byrow=FALSE))
#1. SVD on the Gene Expression
dat1 <- ge
SVD_eset <- svd(dat1)
ScrePlot(getFractionVariance(SVD_eset),getShannonEntropy(getFractionVariance(SVD_eset)))
mtext(side=3,line=0,text="Gene Expression")
datm <- dat1-SVD_eset$d[1]*(SVD_eset$u[,1]%*%t(SVD_eset$v[,1]))
SVD_eset_GE <- svd(datm)
ScrePlot(getFractionVariance(SVD_eset_GE),getShannonEntropy(getFractionVariance(SVD_eset_GE)))
mtext(side=3,line=0,text="Gene Expression")
#2. SVD on the AltEx
dat1 <- EX_nuc
SVD_eset <- svd(dat1)
ScrePlot(getFractionVariance(SVD_eset),getShannonEntropy(getFractionVariance(SVD_eset)))
mtext(side=3,line=0,text="AltEx")
mtext(side=3,line=1,text="Prior removing first component")
datm <- dat1-SVD_eset$d[1]*(SVD_eset$u[,1]%*%t(SVD_eset$v[,1]))
SVD_eset_Ex <- svd(datm)
ScrePlot(getFractionVariance(SVD_eset_Ex),getShannonEntropy(getFractionVariance(SVD_eset_Ex)))
mtext(side=3,line=1,text="After removing first component")
mtext(side=3,line=0,text="AltEx")
#3. SVD on the IR
dat1 <- IR_nuc
SVD_eset <- svd(dat1)
ScrePlot(getFractionVariance(SVD_eset),getShannonEntropy(getFractionVariance(SVD_eset)))
mtext(side=3,line=0,text="IR")
datm <- dat1-SVD_eset$d[1]*(SVD_eset$u[,1]%*%t(SVD_eset$v[,1]))
SVD_eset_IR <- svd(datm)
ScrePlot(getFractionVariance(SVD_eset_IR),getShannonEntropy(getFractionVariance(SVD_eset_IR)))
mtext(side=3,line=0,text="IR")
Letās now compare their first PCs to test who the time is captured by the different matrices in the different SVD.
#PCA plots
par(mfrow=c(3,3),mar=c(4,4,2,2))
plot(SVD_eset_GE$v[,1], SVD_eset_GE$v[,2],col=mycols,xlab="PC1",ylab="PC2",main="Gene Expression",pch=19,cex=1.5,las=1)
plot(SVD_eset_Ex$v[,1], SVD_eset_Ex$v[,2],col=mycols,xlab="PC1",ylab="PC2",main="AltEx",pch=19,cex=1.5,las=1)
plot(SVD_eset_IR$v[,1], SVD_eset_IR$v[,2],col=mycols,xlab="PC1",ylab="PC2",main="IR",pch=19,cex=1.5,las=1)
plot(SVD_eset_GE$v[,1], SVD_eset_GE$v[,3],col=mycols,xlab="PC1",ylab="PC3",main="Gene Expression",pch=19,cex=1.5,las=1)
plot(SVD_eset_Ex$v[,1], SVD_eset_Ex$v[,3],col=mycols,xlab="PC1",ylab="PC3",main="AltEx",pch=19,cex=1.5,las=1)
plot(SVD_eset_IR$v[,1], SVD_eset_IR$v[,3],col=mycols,xlab="PC1",ylab="PC3",main="IR",pch=19,cex=1.5,las=1)
plot(SVD_eset_GE$v[,2], SVD_eset_GE$v[,3],col=mycols,xlab="PC2",ylab="PC3",main="Gene Expression",pch=19,cex=1.5,las=1)
plot(SVD_eset_Ex$v[,2], SVD_eset_Ex$v[,3],col=mycols,xlab="PC2",ylab="PC3",main="AltEx",pch=19,cex=1.5,las=1)
plot(SVD_eset_IR$v[,2], SVD_eset_IR$v[,3],col=mycols,xlab="PC2",ylab="PC3",main="IR",pch=19,cex=1.5,las=1)
Do you see any differences in the clustering of the samples using the different count matrices?
Letās now look into the dynamic over time of the different component:
myMean <- list( t(apply(SVD_eset_GE$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=mean)))),
t(apply(SVD_eset_Ex$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=mean)))),
t(apply(SVD_eset_IR$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=mean)))) )
mySD <- list(t(apply(SVD_eset_GE$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=sd)))),
t(apply(SVD_eset_Ex$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=sd)))),
t(apply(SVD_eset_IR$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=sd)))) )
#Plot component over time
days <- c(0,3,7,14,21,35)
cats <- c("Gene Expression","AltEx","IR","3UTR")
CEX<- 0.7
par(mfrow=c(3,3),mar=c(5,5,2,0),oma=c(2,2,2,2))
for(j in c(1:3)){
for(i in c(1:3)){
MIN=min(0,min(myMean[[j]][i,]-mySD[[j]][i,]))
MAX=max(myMean[[j]][i,]+mySD[[j]][i,])
plot(days,myMean[[j]][i,],pch=19,type="b",lty=2,ylim=c(MIN,MAX),las=1,frame="F",xlab="time [days]",cex=1.0,cex.axis=CEX,cex.lab=CEX,ylab="")
mtext(side=2,line=3,text=paste("PC",i,sep=""),cex=CEX)
mtext(side=3,line=0,text=cats[j],cex=CEX)
grid()
abline(h=0,col="red",lty=2)
}
}
Your homework will be graded according to three criteria: 1) Correctness of your result; 2) Clarity of the visual output; 3) Description of your results demonstrating your ability to discuss your result in their biological context.
Use hierarchical clustering analysis to test whether the time component is differentially captured by changes in gene expression, AltEx, and intron retention the nuclear fractions. ADDITIONAL INFO:
ge, EX_nuc, IR_nuc).info data-frame. This comment is just here to tell you that
if you want to compare the 3 clusterings, it is better that you consider
the same samples only. You may have noticed that in tab_psi you have
more samples than we have in the gene expression matrix āgeā. Therefore
to obtain the matrices EX_nuc and IR_nuc that are derived from tab_psi;
we should filter the samples to keep the same than in āgeā. But this has
been done for you in the part ā## Prepare relevant matricesā.par(mfrow=c(1,3),mar=c(1,0,3,1)) to put the three plots on
the same line.# Hierarchical clustering of the samples is frequently used to analyse whether similar samples cluster together.
# Prepare relevant matrices
#ge <- myE_gen
#EX_nuc <- as.matrix(tab_psi[which(tab_psi$COMPLEX_short=="EX"),match(colnames(myE_gen),colnames(tab_psi))])
#IR_nuc <- as.matrix(tab_psi[which(tab_psi$COMPLEX_short=="IR"),match(colnames(myE_gen),colnames(tab_psi))])
library(ape)
CEX = 1.0
# Perform hierarchical clustering on gene expression matrix (ge)
hcl_ge <- hclust(dist(t(ge), method = "manhattan"), method = "ward.D")
# Perform hierarchical clustering on EX_nuc matrix
hcl_ex_nuc <- hclust(dist(t(EX_nuc), method = "manhattan"), method = "ward.D")
# Perform hierarchical clustering on IR_nuc matrix
hcl_ir_nuc <- hclust(dist(t(IR_nuc), method = "manhattan"), method = "ward.D")
# Plot the three dendrograms in a row. Tree-like diagram that represents the arrangement of objects or variables based on their similarity or dissimilarity.
par(mfrow = c(1, 3), mar = c(1, 0, 3, 1))
plot(as.phylo(hcl_ge), tip.color = mycols, cex = 1.0, label.offset = 0.001, no.margin = TRUE, use.edge.length = TRUE, direction = "rightwards", plot = TRUE, font = 1, main = "Gene Expression")
plot(as.phylo(hcl_ex_nuc), tip.color = mycols, cex = 1.0, label.offset = 0.001, no.margin = TRUE, use.edge.length = TRUE, direction = "rightwards", plot = TRUE, font = 1, main = "AltEx Nuclear")
plot(as.phylo(hcl_ir_nuc), tip.color = mycols, cex = 1.0, label.offset = 0.001, no.margin = TRUE, use.edge.length = TRUE, direction = "rightwards", plot = TRUE, font = 1, main = "Intron Retention Nuclear")
The trees we obtain indicate that the corresponding features (gene
expression, AltEx, or IR) capture temporal changes in gene regulation
across the studied time points. This reflects the intricate regulatory
mechanisms underlying cellular differentiation and maturation.
# Extract the left singular matrix from SVD results
U <- SVD_eset_IR$u
# Get the top n indices for positive and negative values in each column of U
pos_indices <- lapply(1:5, function(i) order(U[, i], decreasing = TRUE)[1:500])
neg_indices <- lapply(1:5, function(i) order(U[, i], decreasing = FALSE)[1:500])
# Extract the genes corresponding to the top indices
pos_genes <- lapply(pos_indices, function(indices) rownames(IR_nuc)[indices])
neg_genes <- lapply(neg_indices, function(indices) rownames(IR_nuc)[indices])# Define the color vector for all timepoints, repeating every 5 samples
mycols_days <- rep(c("#CCFF00", "#33CC33", "#669999", "#6699FF", "#3300FF", "#CC33CC"), each =4 )
# PC 1 and PC 2
par(mfrow= c(2,2))
for (i in 1:2) {
# For top positive genes
pos_genes_data <- IR_nuc[pos_genes[[i]], ]
boxplot(pos_genes_data, main = paste("Behavior of Positive Genes (PC", i, ") Over Time", sep = ""),
xlab = "Time", ylab = "PSI", col = mycols_days, border = "black", notch = TRUE)
# For top negative genes
neg_genes_data <- IR_nuc[neg_genes[[i]], ]
boxplot(neg_genes_data, main = paste("Behavior of Negative Genes (PC", i, ") Over Time", sep = ""),
xlab = "Time", ylab = "PSI", col = mycols_days, border = "black", notch = TRUE)
}# PC 3 and PC 4
par(mfrow= c(2,2))
for (i in 3:4) {
# For top positive genes
pos_genes_data <- IR_nuc[pos_genes[[i]], ]
boxplot(pos_genes_data, main = paste("Behavior of Positive Genes (PC", i, ") Over Time", sep = ""),
xlab = "Time", ylab = "PSI", col = mycols_days, border = "black", notch = TRUE)
# For top negative genes
neg_genes_data <- IR_nuc[neg_genes[[i]], ]
boxplot(neg_genes_data, main = paste("Behavior of Negative Genes (PC", i, ") Over Time", sep = ""),
xlab = "Time", ylab = "PSI", col = mycols_days, border = "black", notch = TRUE)
}# PC 5
par(mfrow= c(2,1))
i <- 5
# For top positive genes
pos_genes_data <- IR_nuc[pos_genes[[i]], ]
boxplot(pos_genes_data, main = paste("Behavior of Positive Genes (PC", i, ") Over Time", sep = ""),
xlab = "Time", ylab = "PSI", col = mycols_days, border = "black", notch = TRUE)
# For top negative genes
neg_genes_data <- IR_nuc[neg_genes[[i]], ]
boxplot(neg_genes_data, main = paste("Behavior of Negative Genes (PC", i, ") Over Time", sep = ""),
xlab = "Time", ylab = "PSI", col = mycols_days, border = "black", notch = TRUE)Hint: U left-singular matrix provides with loadings of the genes/transcripts in individual components.
From the plots corresponding to PC1 we can see that the first component discriminates very well the last timepoint, it provides the best visualization for the differences between the last timepoint with respect to the rest of the timepoints. We see a drop in IR in days 22 and 35 for the 500 most positively contributing genes, while we see an increase in the most negatively contributing genes (these differences are due to maturation, but the biological reason remains unknown). PC2 presents almost the same variation. As for PC3, we can say that itās the best component to see the change in PSI values from the first timepoint to the others, especially for the 500 most positively contributing genes. PC4 and PC5 show all timepoints uniformly, they are not relevant PCs.